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ABSTRACT 

CottonGen (http://www.cottongen.org) is a curated 
and integrated web-based relational database 
providing access to publicly available genomic, 
genetic and breeding data for cotton. CottonGen 
supercedes CottonDB and the Cotton Marker 
Database, with enhanced tools for easier data 
sharing, mining, visualization and data retrieval of 
cotton research data. CottonGen contains 
annotated whole genome sequences, unigenes 
from expressed sequence tags (ESTs), markers, 
trait loci, genetic maps, genes, taxonomy, 
germplasm, publications and communication re- 
sources for the cotton community. Annotated 
whole genome sequences of Gossypium raimondii 
are available with aligned genetic markers and tran- 
scripts. These whole genome data can be accessed 
through genome pages, search tools and GBrowse, 
a popular genome browser. Most of the published 
cotton genetic maps can be viewed and compared 
using CMap, a comparative map viewer, and are 
searchable via map search tools. Search tools also 
exist for markers, quantitative trait loci (QTLs), 
germplasm, publications and trait evaluation data. 
CottonGen also provides online analysis tools such 
as NCBI BLAST and Batch BLAST. 

INTRODUCTION 

Cotton (Gossypium spp.) is the world's leading natural 
textile fibre crop and a significant contributor of oilseed. 
Consisting of 50 species with different levels of ploidy, 
Gossypium has long served as a model for studying funda- 
mental biological questions on genome evolution, plant 
development, polyploidization and crop productivity 



(1-5). The application of new sequencing technologies 
and high-throughput genotyping has improved under- 
standing of diploid and polyploid cotton species and has 
resulted in a wealth of genetics, genomics and breeding 
information for cotton over the last two decades. These 
publicly available resources include 49 genetic maps, 
24 000 markers, >1000 quantitative trait loci (QTL) rep- 
resenting >30 agronomically important traits, phenotype 
data from > 15 000 germplasm accessions, >650000 NCBI 
sequences derived from 181 DNA libraries, 18 000 genes 
and gene products, 460000 expressed sequence tags 
(ESTs) and expression data in the form of microarrays 
and RNA-Seq from high-throughput sequencing. More 
recently, two genome assemblies and annotations of 
Gossypium raimondii, have become available (6,7). The 
availability of the cotton genome sequence provides a 
major source of candidate genes with potential for the 
genetic improvement of cotton quality and productivity. 
Integrating this whole genome data with other genomic 
and genetic data in an online database that is easy to 
query, view and download is essential to maximize 
utility of these valuable research data. 

Three online databases traditionally hosted much of the 
available genomic and genetic cotton data prior to 2012. 
CottonDB (8) was founded in 1995 as part of a national 
USDA-ARS program to develop plant genome databases 
for all agricultural commodities. Using a hybrid database 
system, the genomic, genetic, taxonomic and bibliographic 
data were stored in an object-oriented AceDB database 
(9), while the genetic maps and genome sequences were 
maintained in a MySQL relational database. Initiated in 
2004, the Cotton Marker Database (CMD) (10) was 
funded by Cotton Incorporated to provide centralized 
access to all publicly available cotton simple sequence 
repeat (SSR) markers and accelerate basic and applied 
research in molecular breeding and genetic mapping. It 
used a custom MySQL database with search interfaces 
developed in the Perl programming language. The third 
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database, TropGene Cotton (11), was developed as part of 
a larger project to manage genetic, molecular and pheno- 
typic data on tropical crop species. It uses a custom 
MySQL database with search interfaces developed in the 
Java programming language. The majority of public 
cotton data from TropGene was shared with CottonDB. 
CottonDB, while rich in data, was limited by older tech- 
nology, which resulted in a relatively unfriendly query 
interface and made further development difficult. CMD, 
although more user friendly, was limited primarily 
to marker data and used a custom database schema that 
limited the integration of other types of data. CottonGen, 
therefore, was created to address these limitations 
by consolidating and expanding cotton data from 
CottonDB, CMD and TropGene into a new, standards- 
based, freely accessible scientific database for worldwide 
cotton research. Another feature developed in CottonDB 
but adopted by CottonGen is the hosting of the website 
for the International Cotton Genome Initiative (ICGI). 
ICGI is a non-profit organization created in 2000 to 
increase knowledge of the structure and function of the 
cotton genome for the benefit of the global community. It 
facilitates global communication, collaboration, and edu- 
cation; knowledge and resource integration; technology 
and resource development; and coordinates research 
planning. The CottonGen team agreed to redevelop and 
host the ICGI website within CottonGen as part of its 
mission to serve as a centralized resource for the cotton 
community. 

CottonGen is developed using Tripal (12), a toolkit for 
construction of online genomic and genetic databases. 
Tripal is based on a community-derived database 
schema named Chado (13) and employs the use of 
controlled vocabularies such as the Sequence Ontology 
(14), Gene Ontology (15) and others to ensure standard- 
ization of data storage. Tripal currently is used for several 
genome databases (16-21). Additionally, Tripal provides 
simplified site development by merging the power of 
Drupal (http://drupal.org), a popular web Content 
Management System allowing non-programmers the 
ability to contribute content with Chado. 

Migration of data from CottonDB to CottonGen was 
initiated on 1 October 2011, and CottonGen was released 
one year later, superseding CottonDB and CMD with 
additional data and enhanced functionality. As of 15 
August 2013, CottonGen includes (i) the Gossypium 
raimondii whole genome assemblies and annotation, (ii) 
annotated unigene for the Gossypium genus, (iii) extensive 
genetic and QTL maps, markers and traits, (iv) trait evalu- 
ation data, (v) enhanced user interfaces including various 
search tools with downloadable results and (vi) resources 
to support community activities and to facilitate commu- 
nication among cotton researchers. Here we describe the 
data and the functionality in CottonGen. 

DATABASE DESCRIPTION 

CottonGen Data and Web Interface 

CottonGen contains various genetics, genomics and trait 
evaluation data including annotated whole genome 



sequences, EST sequences, markers, traits, genetic maps, 
genes, taxonomy, germplasm and publications. All 
CottonGen web pages have a common navigation menu 
for easy access. The navigation menu provides links for 
general information, data, search, tools, help and commu- 
nity resources for the ICGI. The data section lists major 
data classes in CottonGen (Table 1), such as gene, 
genome, germplasm, map, marker, publication, species 
and trait. Users can view a summary of the data, and 
various links to access the data. The search section lists 
various search tools such as for genes, germplasm, 
markers, QTL, publications and trait evaluation. Each 
search tool provides options for customization by 
applying restrictions in the query. From the search result 
site or the downloads page, users can download the entire 
data and/or go to the various data details pages. Major 
CottonGen data and the web interface to the data are 
described below. 

Genomics data 

Whole genome sequence data 

CottonGen includes the first fully sequenced cotton 
species, Gossypium raimondii, from two independent 
research teams (6,7). On CottonGen, these assemblies 
are titled the 'Gossypium raimondii (D5) genome JGI 
assembly v2.0 (annot v2.1)' (6) (referred to hereafter as 
the JGI version) and the 'Gossypium raimondii (D5) 
Draft Genome BGI-CGP vl.O Assembly & Annotation' 
(7) (referred to hereafter as the draft BGI version). The 
predicted genes from these assemblies have been further 
annotated by the CottonGen team to include homology to 
proteins in other well annotated or closely related species, 
and in silico annotation of InterPro protein domains, GO 
terms and Kyoto Encyclopedia of Genes and Genomes 
database (KEGG) pathway terms, providing information 
on probable pathways and traits. Additional annotation 
by the CottonGen team includes the alignment of cotton 
genetic markers, and cotton transcripts such as 
CottonGen Unigene version vl, Udall cotton Unigene 
contigs (22), PlantGDB Cotton Unigene and NCBI 
Cotton ESTs from all major Gossypium species. Single 
nucleotide polymorphisms (SNPs) between the diploid 
genomes of A and D and those between the tetraploid 
genomes of AT and DT (T represents tetraploid) were 
also aligned to the JGI version of the G. raimondii refer- 
ence genome (23,24). The annotated sequence data can be 
accessed in CottonGen via the genome page, gene and 
sequence search tools and GBrowse (25). The genome 
pages, found under the data navigation menu, contain 
various downloadable files including the FASTA files of 
predicted gene transcripts, coding sequences (CDS) and 
predicted gene peptides. Excel files of protein homologues 
with cotton genes and other species including those found 
in databases such as Swiss-Prot and TrEMBL (26) and 
NCBI nr (27), are also available with hyperlinks to these 
databases. Other downloadable files include ESTs and 
genetic markers in FASTA and Excel format that map 
to the whole genome sequences and functional annotation 
files from protein, Interpo and KEGG alignments. In the 
gene and sequence search tools, whole genome data can be 
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5 peptide data sets, 15 nucleotide data sets (genome sequences, markers, unigenes, ests) for BLAST 
searching. 

Draft BGI vl.O and JGI annot v2.1 G. raimondii genome projects. 

1269 cotton genes from NCBI gene (06/12/2013); 40976 and 77726 CDS from the BGI vl.O and JGI 
annot v2.1 G. raimondii genome projects, respectively, and 21698 Contigs from CottonGen 
Gossypium Unigene vl.O. 

From 14 collections. 

19074 SSRs, 3541 RFLPs, 2146 AFLPs, 1018 SNPs and 310 other types. 
34 559 loci 

Representing 25 traits 

Journal articles, conference proceedings, patents, book chapters and theses. 
Origin, genome group, germplasm, haploid number, sequences and libraries. 
From 6871 accessions 



Table 1. Number of CottonGen entries by data type (15 August 2013) 

Data type Number Details 

of entries 



BLAST 20 

Genome 2 

Gene 119 971 

Germplasm 14959 

Marker 23 935 

Map 49 

QTL 988 

Publication 10731 

Species 49 

Trait evaluation 73296 



found by filtering by name, GO terms, InterPro domains 
or KEGG pathway terms (28) (Figure 1). From the align- 
ment page, users can go to GBrowse. Using GBrowse, site 
visitors can view genomic features aligned to the genome, 
such as gene models, repeats, SNPs, as well as alignments 
of ESTs, repeats, genetic markers and genes from other 
plant model species. Each feature in GBrowse is hyper- 
linked to a page with sequences and additional informa- 
tion, and hyperlinks to external databases where 
applicable. The chloroplast genome sequences and anno- 
tations of Gossypium hirsutum, Gossypium barbadense, 
Gossypium arboreum and G. raimondii are also available 
in GBrowse. 

Annotated EST unigene 

CottonGen contains all Gossypium ESTs publicly avail- 
able from dbEST at NCBI as of 12 September 2012. To 
reduce inherent redundancy in ESTs and generate a 
data set representing the genes of cotton, we developed 
the CottonGen vl.O unigene. Routine processing 
involved sequence filtering for contamination against 
the NCBI UniVec database and species-specific chloro- 
plast, mitochondrial, tRNA and rRNA sequences using 
the BLAST algorithm with NCBI UniVec-recommended 
parameters; trimming of low quality sequence; assembly 
into contigs using CAP3 (29) with an overlap percentage 
parameter of 90% (p -90); and annotation. 437 185 
filtered sequences were assembled into 21 698 contigs 
and 128 218 singletons to make a unigene set of 
149 916 sequences. The CottonGen annotation proced- 
ure includes comparison of both the filtered ESTs and 
the EST contig consensus sequences using BLASTX 
against the SWISS-PROT, TrEMBL, InterPro, TAIR 
(30) and other well annotated species protein databases. 
The top 10 matches with an expectation value <le-6 are 
recorded for each EST and contig. Results of in silico 
functional annotations of Gene Ontology (GO) terms 
and functional classification by pathways from KEGG 
are also recorded in the database. The 21 698 contigs 
from the vl.O unigene can be searched using the gene 
and sequence search tools by name, Interpro domain, 
GO term or KEGG term or gene and the results down- 
loadable as Excel files from the search page. All the 



unigene data set and annotations can also be obtained 
from the downloads page. Additional sequence annota- 
tion includes computational analysis of SSR found in 
the unigene contigs using the method described in 
Jung et al, 2008. Of the 21698 contigs, 24.6% had 
one or more SSRs, with 493 motifs detected in 6979 
SSRs. The results may be obtained from the 
Downloads page as an Excel file with details for each 
SSR containing sequence including motif, motif length, 
location in the sequence, location relative to the ORF, 
suggested primers and expected product size. 

NCBI genes 

All Gossypium sequences from the NCBI nucleotide 
database were downloaded, parsed for gene, mRNA, 
CDS, 5'UTR and 3'UTR features and imported to 
CottonGen. As with predicted genes from whole genome 
sequences, genes parsed from NCBI have been further 
annotated by homology to genes in other species, 
InterPro protein domains, GO terms and KEGG 
pathway terms. The distinct gene names in Gossypium 
are stored separately in the database to build a commu- 
nity-driven gene database for cotton. Each gene, unique in 
the Gossypium genus, is currently linked to all the NCBI 
genes from various species and will serve as a base entity 
to be linked to other associated data such as predicted 
genes from whole genome sequences, QTL, genetic 
markers and mutant phenotypes as annotation progresses. 
All genes and mRNAs that are parsed out from NCBI 
sequences are searchable in the gene search site. 

Map, marker and QTL data 

CottonGen provides access to the cotton genetic, QTL, and 
physical (FPC) maps, including the underlying molecular 
markers, QTL and mapping populations. For sequence- 
based markers such as SSRs, Amplifed Fragment Length 
Polymorphisms (AFLPs), Sequence Related Amplified 
Polymorphisms (SRAPs), and cDNA-Rapid Fragment 
Length Polymorphisms (RFLPs), CottonGen provides 
details on experimental conditions, such as the primer, 
amplicon-sequence information and the PCR amplification 
conditions. CottonGen currently has 49 maps, which covers 
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Gene/Sequence interPro 
InterPro Search Criteria 



To search for genes annotated with specific protein domains, enter the InterPro do 
or gene names. You may also filter results by sequence type and the analysis use> 
[esults car te downloaded in tab-delimited or Excel formats. 



-Any- 




con tig 









-Any- 

Gossypium raimondii D Genome BGI-CGP v1.0 



Gossypium raimondii □ Gen 



Gossypium unigene .1 0 



lPRTerm 
Contains \V\ 
jzincfingeT 



Providesn IntE-FRC deMn Bi Ofl Of tsywords {e.g. 
IPR Accession 

1 

Providean IntefPro identifiers, [eg IPR0315B4J. 

GeneTeature Name 



B 



8,344 records vvere returned. 




Gene/Feature Ha me 


Type 


Accession 


IPR Term 


Source 


Gorai.001G001400.1 


mRNA 


IPR013083 


Zinc finger. RING/FWEJPHD-type 


Gossypium raimondii D Geno 


Gorai.001G001400.2 


mRNA 


IPR013083 


Zincf nger. RING/FYVE/PHD-type 


Gossypium raimondii D Geno 


Gorai.001G001400.3 


mRNA 


IPR013083 


Zinc finger, RING/FYVE/PHD-type 


Gossypium raimondii D Geno 


Gorai.001G001500.1 


mRNA 


IPR022755 


Zinc finger, double-stranded RNA binding 


Gossypium raimondii D Geno 


Gorai.001G001500.1 


mRNA 


IPR007087 


Zincfinger, C2H2 


Gossypium raimondii D Geno 


Gorai.001G001500.1 


mRNA 


IPR015880 


Zinc finger, C2H2-like 


Gossypium raimondii D Geno 


Go rai.OOIGOO-t 500.1 


mRNA 


IPR003604 


Zincfinger. U1-type 


Gossypium raimondii D Geno 


cnr^i nnmnn^nn 7 


mRKU 


iPRnmntu 


7inrfinnor I H-tvnp 





Annotated Sequence 

Legend: five_prime_UTR 



CDS three_prime_UTR 



Hole ih-i cursor over a t.po a; d,s M h ghiicht its positions in tho sequence 
types overlap. 

>Gorai.001GOQ1500.1 Chr01:151504. .153370 - 

AAAAGT AAAT GASRTT AATGAGI TAAAATAG "~ GAAGC CCA GGTGCTCC 
III 1 GAT C GAAT G~AA GAACAAT GATACAGC CACAGGCC AIGAGGT GGGA 
AACSAAAICCCAAATTGA7T2;?ATTTS™AT77-TI7AA7I77AATATAT 

CTGSGCTASCCTGCaaCGC C rGC&AC&SASau: I C C T T "A™ AEG™ "SAG 
C5SAA3Cr;CAIIACAA2rCCSAITS3"ACCSCrACAAICTCAA3C3:AA 
SCI CTT GT T ITTCTT ATTCTTTT C!TT IT AIT ACT AT I AT AT CTCTTACCA 
GTATTAT I AT ATTTTTT AT TTCAGAI GAAAI II CI GAAAT AI CIT I T AI C 
IT^A?Ai;.:.-^.ITA;A.-j;TA77;.:T.^.7I^^:;Ar.:rAATI7A^7TI^CA 



Resources 

• mRNA Details 

• Annotated Sequent^ 

• Relationships 

• Alignments 

• InterPro Report 

• Cocao Home-logo 

• Swissprot Homologs 

• TAIR10 Homologs 

• Rice Homologs 

• Poplarhamologs 

• TrEMBL Homologs 

• NCBI nr Homologs 

• Analyses 



II77AAAIAT77Ari7Ii:^C:-.7^;-7- 1 7 7A:- i ::- i :- i 7:-,I7C;iA72A7TIA;7 
AIT IT AAI CAT AI AI T GT AI GCAATI CAAI GTT AT CAAT IT CATCATTTT 
I GAAAAT A GAACAAAT I GAAAT I IT I GAGAT GCT AGAGT IT GAT GT CI CI 
GT I GAAT GATT GT GCAAI CAAT G ATT GOT CTTGATT AAT IATGGGGATTC 
TTTI ATT AT GGT AAAAI I AAA T C A 3GT AGCT G GGGT T C CAGG GGT GACGG 
AA G CT 1 1 GT T CCT GGCAAG GC AAGCA GCACIIGCI GAAGAGAAAGAIAAG 
CASAAT GAAACCCCCAT GCTTTACAGTTGTGGTCTTT GT GACAAGGGCT A 
I C GAA GT I C C AA G G 7 7 T A7T 7 T 7 A G GAT 7 1 T AAGT 7 A 7 GT GCT CAT ATT G 



Alignments 

The following featur 


3S are aligned to this mRNA 




Feature Name 


Type 


Location 


Chr01 


chromosome 


0*01:151504.. 153870- 





Landmark or Region 

ChrOI 151 504 153 870 Search 

Examples Chr03 6.035.002 6,235,001. BNL1690 

Data Source 

G raimondii D genome JGIv2 0 iannot .2 1; T] 
I Overview 



Annotate Restriction Sites [T] Cont 



Crir05 20.230.276 20.330.275 




Figure 1. Gene/Sequence search site in CottonGen. (A) Genes/sequences can be searched using various categories, such as by name, GO terms, 
InterPro protein domain name or KEGG pathway term. The example shows the InterPro term search site. (B) The search result page has links to the 
download, gene/sequence detail page and external database. (C) The Gene detail page has various tabs to show the data. The annotated sequence 
page is highlighted. (D) The alignment tab of the gene detail page shows the position in the whole genome with link to GBrowse. (E) The GBrowse 
page linked from the alignment tab of the gene detail page. Users can go back to the gene detail page from GBrowse. 



Gossypium genome groups AD, A, D and G, consisting of 
approximately 34000 marker loci and a thousand QTLs. 
Markers can be browsed and searched using various 
search interfaces (found under search and then markers in 
the navigation menu). All markers can be searched by 
marker source, map information or nearby loci. The 
advanced marker search interface allows researchers to 
search by various categories in combination (Figure 2). 
Researchers can also browse/search only the mapped 
markers with sequences using various categories. From 
the search result page, researchers can go to the details 
pages of markers, maps, sequences, germplasm and 
species. From the marker details page, relevant data such 
as marker source, primers, polymorphisms, map informa- 
tion and anchored position in the genome can be accessed. 

CottonGen houses 273 QTLs with associated data such 
as CottonGen curator-assigned QTL label, published 
symbol, trait name, alias, population, map position, 
associated markers and statistical values. The QTL 
search page allows searching for QTLs by trait name, 
published symbol and QTL label. Search results are 
hyperlinked to CMap (31) and downloadable in Excel 
format. 



Germplasm and trait evaluation data 

CottonGen includes information for each of the 50 
Gossypium species such as genome groups, geographic 
origins, inter-species compatibilities and germplasm. 
About 15 000 germplasm accessions are stored in 
CottonGen. These individuals were identified from 
>47 000 entries that consist mainly of the USDA-ARS 
Germplasm Resources Information Network (32) cotton 
collection, the cotton germplasm collection of the China 
Cotton Research Institute, the Chinese Academy of 
Agricultural Sciences and the cotton germplasm collection 
of Uzbekistan Center of Genomics and Bioinformatics, 
Academy of Sciences of Uzbekistan. Germplasm data 
include aliases, pedigrees, publically available passport in- 
formation, stock collection centre, associated maps, 
libraries and sequences. In addition, trait evaluation 
data, with > 1 18 000 trait scores, from ~9000 germplasm 
are available. The Gossypium species summary page 
(found under data and then species in the navigation 
menu) provides a list of species along with information 
such as genome group, haploid chromosome number 
and geographic origin. The summary of data available in 
CottonGen is also given: number of germplasm, sequences 
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Search Markers 



o Search lot marker source information AtaDle ol markers listed Dy 

o Search (or mapped markers A la Die Ot all markers listed Dy marker name with marker tyi 

e Search markers on nearrjy loci A list ot all loci that are within a specified distance of the 

o Advanced Marker Search Search markers in either a comnination of type, species and r 
Mapped markers with sequences 

o Browse or search by map name 

o Browse or search by chromosome number 

o Browse or search Dy Genome Group 



Marker Source Information 












Source 


molecule type 






[Appiy] 






| l-Any- 




|-| Deltapine 90 |- 


SSR -r 




1.183 records we 


e returned. 












Marker Name 


MartkerType 


Source Sequence 


Sequence Molecule Type 


DNA Library 


Source Gemplasm 


Source Species 


BNL0113 


SSR 


BNL0113 




G.hfcr-sw 


Deltapine 90 


Gossypium h rsutjm 


BNL011S 


SSR 


BNL0116 


genomic 


G.hfbr-sw 


Deltapine 90 


Gossypium h rsul_m 


BNL0117 


SSR 


BNL0117 


genomic 


G.hlbr-sw 


Deltapine 90 


Gossypium husutjm 


BNL011B 


SSR 


BNL0118 


genomic 


G.h.fbr-sw 


Deltapine 90 


Cossypiurn h isut.m 


BNL0119 


SSR 


BNLC 119 


gentile: 


G.h.fbr-sw 


Deltapine 90 


Gossypium hirsutum 


BNL012B 


SSR 


BNL0128 


genomic 


G.h.fbr-sw 


Deltapine 90 


Gossypium hirsutum 








Id 






E 



BNL0116 (genetic_marker) Gossypium hirsutum 



Marker Details 



Name BNL011 
Alias BNL11E 



GenbanklD 
Type 
Species 
Gemiplasm 
Source Sequence 
Source Type 
Repeat Motif 
PCR Condition 
Primer 1 



Gossypum hirsutum 



BNL0116 

genomic 



(CT)16 

.Annealing temperature: 55 
BNL0116_F: GC G GCAT G CTTTCTTC ATC ATATA 
Primer 2 BNL01 1 6_R; ATAAC CT GT G.ACATC TTTTTTT G C 



Product Length HIA 

Max Lengtti MCA 

Reslriction Enzyme HIA 
Polymorphi 



P_BNL011£ 

Map position [view all 6] 
Publication I if A 
Contact N/A 



BNL0116 (genetic marker) Gossypium hirsutum 












Map Positions 
















Marker 'BNL01 16" includes: 
















Total 6 map positions 
















# Map Name 


Linkage Group 


Bin 


Chromosome 


Position 




CMap 




1 AD-genome wide 

Reference Map (2009) 


AD-genome wide Reference Ma 
(2009).Ref-chr26 


M7A 


AD_chr.26 


101.00 


bnl0116 


View 




2 Monsanto SSR Bin Map, 
(2009) 


Monsanto SSR Bin Map, 
(2009).BIN-SSR_chr26 


N/A 


ATJ_chr_26 


90 00 


BNL0116P 






3 Handan-208 x Pima-90. 
F2:3 (2007) 


Handan-20Bx Pima-90, F2:3 
(2007>HP-F2:3_chr26 


N/A 


AD_chr.26 


74 20 


BNL116* 


View 



141.1-1 14-- H*V !4i4.'.. 144,1,1 I44U11 144?lil 




tBDShOI Arabidopsis ttialiana TAIR10 
RFLP 

• : nil ! 



■ View BNL011G details in GBrowse 

V.ei', B'IL.'116 -3i::4isag4 in CullQiiGpn 

■ View BNL0116 in CMAP 



Figure 2. Marker search site in CottonGen. (A) Multiple markers search sites are available based on the type of information users are interested in. 
(B) An example search interface where users can view and search for marker source information. (C) A Marker details page with various links to 
detailed information. (D) The Map position tab of the marker page shows all the maps where the marker has been mapped. (E) From the marker 
page users can go to the CMap. (F) For the markers that are anchored to the genome, CMap provides hyperlinks to GBrowse. From GBrowse users 
can follow the links to go back to Cmap, the marker detail page or the Sequence Retrieval Tool. 



and DNA libraries. The species name in the table leads to 
a species page, which shows more details such as common 
name, images and additional data as seen in the summary 
table. The species page also shows the results of functional 
analysis of the genes, both from NCBI and whole genome 
sequences, which include KEGG and GO analysis reports. 
Several germplasm search pages provide access to differ- 
ent types of data (Figure 3). The search by collection page 
provides a list of germplasm along with stock collection 
centre information. The search can be filtered by collec- 
tion centre name, germplasm name and/or accession name 
in the stock centre. The search by pedigree page provides 
an interface to search germplasm by pedigree and the 
search germplasm by country page searches by the 
country of origin. From the germplasm search page, re- 
searchers can go to the germplasm details page, which 
shows all the detail information such as pedigree, 
passport, collection centre, image and associated geno- 
typic and phenotypic data. Germplasm can also be 
searched based on their trait evaluation data. Both the 
qualitative and quantitative trait evaluation search sites 



allows the trait values of up to three trait descriptors to 
be specified to view the germplasm trait data. Data from 
all the search result sites can be downloaded in Excel files. 

Publications 

CottonGen houses information about publications that 
are important to cotton researchers. Details about publi- 
cations were imported to CottonGen from NCBI PubMed 
(http://www.ncbi.nlm.nih.gov/pubmed) and the USDA 
National Agricultural Library (NAL) (http://agricola. 
nal.usda.gov/) databases. Additionally, details of publica- 
tions from other journals not present in PubMed or the 
USDA NAL databases were manually imported to 
CottonGen. In addition, CottonGen maintains reference 
information and abstracts for works published in cotton 
research conference proceedings such as the ICGI 
Conferences and the Plant and Animal Genome 
Conferences. Book chapters, theses and patents are also 
collected. In total, CottonGen houses 10 731 references. 
Publications can be found using a combinations of 
keywords (in the abstract or title), all or partial titles, 
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Figure 3. Germplasm search site in CottonGen. (A) Multiple germplasm search sites are available based on the type of information users are 
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various tabs to show the detailed information. (D) The Map tab of a germplasm page shows all the maps for which the germplasm has been used. 
(E) From the map page users can open CMap for further exploration. 



authors and other categories. Search results link to publi- 
cation pages that contain the abstract, citation, external 
link to the full article and other details about the 
publication. 

Online analysis tools 

CottonGen contains several online analysis tools. These 
include an instance of NCBLs wwwBLAST tool (http:// 
www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/) 
and a custom Batch BLAST tool where users can perform 
pair-wise BLAST alignments using their sequences against 
the current 20 CottonGen data sets. The Batch BLAST 
server supports upload of large data sets for pair- wise com- 
parison. It executes BLAST, and parses the output into an 
Excel file. Users are notified by email when the job is 
complete and directed to a website to download result 
files. The same data sets are available in both BLAST 
servers for alignment. Protein data sets available for 
BLAST include Gossypium proteins from GenBank and 
UniProKB and G. raimondii protein sequences from the 
draft BGI vl.O and JGI v2.1 genome data. Nucleotides 
sequence databases include GenBank Gossypium 



sequences, Gossypium dbSNP, CottonGen SSR, RFLP, 
and SNP/InDel marker sequences, CottonGen 
Gossypium unigene vl.O, DFCI Cotton Gene Index vll 
(http://compbio.dfci.harvard.edu/tgi/plant.html), PlantGDB 
(http://www.plantgdb.org/) unigene from several Gossypium 
species, Udall 2012 transcript contigs and predicted genes and 
genome sequences from the BGJ and JGI genome data. The 
Sequence Retrieval tool enables download of sequences 
including full chromosomes, scaffolds, genes, full transcripts, 
transcript coding sequences, proteins, genetic markers 
aligned to chromosomes, unigene contigs and ESTs. Users 
supply a list of sequence names to retrieve, and can filter by a 
specific genome assembly, unigene or other project data. For 
features aligned to a whole genome, such as genes, transcripts 
and genetic markers, a user can include a specified number of 
upstream and downstream bases in the sequence. 

Community resources 

CottonGen houses the resources for the ICGI. It main- 
tains the ICGI membership database, information for the 
ICGI biennial international research conferences, hosting 
of biennial elections and tools for registration and 
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manuscript submission for the 2012 ICGI Conference. 
The CottonGen home page includes rotating pictures for 
recent research stories or community news, brief project 
descriptions, a news section for the cotton community and 
a section to quickly find newly added site functionality or 
data. Email mailing lists for both CottonGen and ICGI 
are available for communication with the community, and 
the mailing list archives can be viewed online. Other re- 
sources in the help section provide a Frequently Asked 
Question page for CottonGen and ICGI and user tutorials 
for both. 



FUTURE PLANS 

CottonGen will be updated as new data become available 
and new or improved functionality is added to the site. 
This includes adding GBrowse-syn, a GBrowse-based 
synteny browser (33), to view multiple sequence alignment 
data, synteny or co-linearity data from closely related or 
useful species such as cacao and Arabidopsis. A compre- 
hensive breeders toolbox, similar to that developed for the 
Rosaceae community as part of the USDA NIFA SCRI- 
funded project RosBREED (Grant number #2009-51181- 
06036), is planned for future implementation. In addition, 
a digital image library will be created for over one 
hundred thousand images created from the USDA-ARS 
Research Project: 'Genotypic and Phenotypic Analysis 
and Digital Imaging of Accessions in the US National 
Cotton Germplasm Collection'. The associated pheno- 
typic data will also be stored in CottonGen. 



CONCLUSION 

CottonGen is now the consolidated cotton genomics, 
genetics and breeding database for the cotton community. 
It aims to provide a comprehensive, integrated, online 
resource that serves basic, translational and applied 
cotton research. It is constructed using the open-source 
Tripal genome database toolkit, which merges the power 
of Drupal, a popular web Content Management System 
with that of Chado, a community-derived database 
schema for storage of genomic and genetic data. Data 
types in CottonGen include maps and markers, whole 
genome assemblies and annotations, gene and sequences 
with analyzed data, taxonomic and germplasm data and 
publication data. CottonGen maintains online resources 
for ICGI, a non-profit organization created as a global 
affinity group with common goals and interests. From 
its release on 1 March 2012 to 15 August 2013, 
CottonGen had 11111 visits by 4756 unique visitors 
from 94 countries who accessed 75 551 pages. 
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